New molecular structure based models for estimation of the CO2 solubility in different choline chloride-based deep eutectic solvents (DESs)

In this study, CO2 solubility in different choline chloride-based deep eutectic solvents (DESs) has been investigated using the Quantitative Structure–Property Relationship (QSPR). In this regard, the effect of different structures of the hydrogen bond donor (HBD) in choline chloride (ChCl) based deep eutectic solvents (DESs) has been studied in different temperatures and different molar ratios of ChCl as hydrogen bond acceptor (HBA) to HBD. 12 different datasets with 390 data on the CO2 solubility were chosen from the literature for the model development. Eight predictive models, which contain the pressure and one structural descriptor, have been developed at the fixed temperature (i.e. 293, 303, 313, or 323 K), and the constant molar ratio of ChCl to HBD equal to 1:3 or 1:4. Moreover, two models were also introduced, which considered the effects of pressure, temperature, and HBD structures, simultaneously in the molar ratios equal to 1:3 or 1:4. Two additional datasets were used only for the further external validation of these two models at new temperatures, pressures, and HBD structures. It was identified that CO2 solubility depends on the “EEig02d” descriptor of HBD. “EEig02d” is a molecular descriptor derived from the edge adjacency matrix of a molecule that is weighted by dipole moments. This descriptor is also related to the molar volume of the structure. The statistical evaluation of the proposed models for the unfixed and fixed temperature datasets confirmed the validity of the developed models.

www.nature.com/scientificreports/ observed. Although they used large number of descriptors for each component of DESs (i.e. HBA and HBD), their developed linear model have limited prediction capability. Besides, their developed model was not descriptive due to the application of sigma profile descriptors, which are not interpretable. Furthermore, they used molar ratio as an independent variable in their linear model. The relationship between HBA to HBD molar ratio and solubility is not linear (see Fig. S1 in the supporting information file). Therefore, in the present study, it was tried to find the most important interpretable descriptor of HBD in the presence of the fixed HBA (i.e. Choline chloride). Kumar et al. 46 presented 12 QSPR models for the prediction of the CO 2 capture capacity of DESs considering the effect of HBA and HBD structures, HBA to HBD molar ratio, temperature and pressure. The Monte Carlo method was used to determine the appropriate coefficients of each quasi-SMILES descriptors for 72 different DESs (including 19 different HBAs and 20 different HBDs). Their developed models included four random splits from datasets as well as three target functions with and without criterion of the predictive potential examination (i.e. index of ideality of correlation (IIC) and correlation intensity index (CII)). Then, they introduced the model with the highest accuracy according to different statistical parameters. Although their work was very comprehensive and valuable because of using diverse dataset and also high prediction accuracy of model, it seems that the parameters of their model cannot be interpreted and the effect of each parameter in the CO 2 absorption mechanism cannot be investigated. In other words, it seems that they paid more attention to the predictability of the model instead of describing why and how each of the variables in the developed model affect the CO 2 capture capacity. Therefore, in the present study, it has been tried to develop descriptive and predictive QSPR models with meaningful and interpretable descriptor.
Halder et al. 47 performed multicriteria decision techniques to develop multi-objective models to investigate two properties (i.e. viscosity, and CO 2 uptake capacity), simultaneously. Their work is valuable because the viscosity of DESs plays a significant role in the final solvent choice. They developed two linear QSPR models for predicting the CO 2 uptake capacity and viscosity of DESs, separately. Then, they used the Derringer's desirability function to integrate these two models for identification of the DESs with high CO 2 absorption capacity and low viscosity. Although their work was very innovative and comprehensive, there are few flaws in their work. First of all, according to the MD simulation performed by Alizadeh et al. 48 , there is a strong effect of HBD structures and anion part of HBA and a slight effect of cation part of HBA on CO 2 solubility in the DESs. Meanwhile, at a lower pressure, the HBD-CO 2 interaction is dominant, and at a higher pressure, it is the anion-CO 2 interaction. In another word, HBD structures have a greater effect on the CO 2 absorption at low pressures and HBA structures at high pressures. However, Halder et al. 47 have considered the effectiveness of HBA (both cation and anion parts) and HBD in all conditions to be the same. Second, temperature and pressure variables were not present in their model and the prediction was made only by structural variables. While, it has been proven that temperature and pressure have a significant effect on the CO 2 absorption. Thus, in the present study, an effort has been made to investigate the effect of HBD structures on the CO 2 solubility in low pressure (i.e. physical absorption) while considering the key parameters of temperature and pressure in the developed model. Therefore, in this study, it was tried to fill the observed gaps in the recent invaluable researches.
In this study, the QSPR method is applied as a robust tool to develop predictive models for solubility of CO 2 in the DESs with a fixed HBA (i.e. ChCl) with the molar ratio of HBA to HBD equal to 1:3 and 1:4. At first, some QSPR models are developed, which can consider the effect of the HBD structures and pressure at fixed temperature (i.e. 293, 303, 313, or 323 K). Then, the CO 2 solubility dependence on temperature was considered along with the pressure and HBD descriptor. This approach can efficiently predict the CO 2 solubility for new ChCl-based DESs in new temperatures. Moreover, two additional datasets were applied for further external validation to confirm the robustness of the unfixed temperature models.

The QSPR method
Dataset. The available experimental data of CO 2 solubility in ChCl-based DESs with molar ratios of 1:3 and 1:4 were collected from the literature, at first. The range of P, T, and CO 2 solubility for each dataset was shown in Table 1. The total number of CO 2 solubility data points is 390. As can be seen in Table 1, the variation of the involved HBD in DESs was nine. In the present study, the values of CO 2 solubility (x: mole of CO 2 per mole of DES) have been converted into the form of the natural logarithm (i.e., ln(x)) for the model development. A common technique used to ensure the reliability of the developed QSPR models is to divide the datasets into two separate sets called "train" and "test". It should be mentioned that the QSPR model was developed using the train set, and the internal validation technique can be applied to this set. The developed QSPR model should be externally validated by taking some HBDs out of the datasets and putting them into the test set. Through this work, the prediction capability and accuracy of the developed model can be assessed. In order to increase the robustness of the external validation, it was tried to select the test set in such a way to consist of some HBD structures, which are different from the involved structures in the train set. In addition, datasets no (11) and (12) have been used for further external validation of the developed models in the unfixed temperature status and applied the models at new temperatures, pressures and HBD structures. Furthermore, the applicability domain of the constructed models has also been checked for both the train and test sets, which indicates that both of them contain DESs with considerable differences from a molecular structure viewpoint.

Optimization of HBD structures and descriptors calculation.
Before calculating the descriptors of each HBD, it is essential to optimize their molecular structures. The 3D structures of 9 HBD molecules were drawn using gauss-view software 51 and then were submitted to geometry optimization using the density functional theory (DFT) at the level of B3LYP and 6-31 + G (d,p) 52 . Afterward, Dragon software 53  Basic theory and model construction procedure. Basic theory. CO 2 solubility in the gas-liquid systems (i.e. CO 2 in DES) is defined as follows: According to Li et al. 24 , the CO 2 solubility is dependent on the temperature and pressure as well as the HBA to HBD molar ratio.
In a constant HBA to HBD molar ratio, the relationship between ln(x) and ln(P) can be considered as follows (see Fig. S2 in the supplementary file): where a, and b represent the adjustable parameters. As it is clear, the molecular structure of HBDs can play a key role in different processes such as desulfurization 22 and CO 2 solubility 20 . In this study, the QSPR method will be used to correlate ln(x) to ln(P) and a relevant molecular descriptor of HBDs by the replacement of the "b" parameter. In order to investigate the effect of HBD molecular structure on the CO 2 solubility, eight separate datasets have been applied with fixed temperature considering Eq. (3): The CO 2 solubility values can be predicted only in the fixed temperature (i.e., 293, 303 313, or 323 K) using Eq. (3). In order to take into account the effect of temperature along with the descriptor and ln(P), Eq. (4) has been considered by the replacement of the "c" parameter in Eq. (3) with " b × T " term. According to the observed trend for the CO 2 solubility with temperature (see Fig. S3 in the supplementary materials), T was considered as a linear variable in the developed models taking into account the effect of temperature: Model development strategy. In the present study, two types of QSPR models have been developed. Equation (3) is applied for the development of the model for the fixed temperature datasets. Equation (4) is applied for the development of the model taking into account the temperature effect on the CO 2 solubility. Using Eq. (4), the multiple linear regression (MLR) model with three variables (i.e., ln(P), T, and the molecular descriptor of HBDs) was used to derive a predictive and descriptive QSPR model. It is important to note that the suitable descriptor of HBDs should be selected from a set containing various different HBD descriptors (i.e., 444), the ln(P), and T variables. Variable selection for QSPR models can be performed following several approaches 54 . In this study, the Genetic Algorithm (GA) was applied to select the variables of the QSPR model. Further information on the genetic algorithm-multiple linear regression (GA-MLR) can be seen elsewhere 55,56 . It should be noted that the GA-MLR models were built using QSARINS software 57 .
(2) ln(x) = a × ln(P) + b, Table 1. The variation ranges of pressure and solubility for each studied dataset in the present study. a T is the temperature in units of K. b variation domain of P for each HBD. P is the pressure in units of bar. c variation domain of x for each HBD. x is the solubility of CO 2 in DES in units of mole CO 2 /mole DES. d 1,2-Propanediol, 1,4-Butanediol, 2,3-Butanediol, Diethylene glycol, Guaiacol, Phenol, Triethylene glycol, Furfuryl alcohol, Levulinic acid. e 1,2-Propanediol, 1,4-Butanediol, 2,3-Butanediol, Diethylene glycol, Guaiacol, Phenol, Triethylene glycol. f Furfuryl alcohol, Levulinic acid, glycerol. g Triethylene glycol, Ethylene glycol, Furfuryl alcohol, Levulinic acid, Urea. www.nature.com/scientificreports/ Validation of developed models. The estimation capability of all QSPR models should be assessed by implementing internal predictive performance and external predictive performance evaluations. The training set is used for the internal validation, while the test set is used to conduct the external validation. There are several statistical parameters that can be applied to examine the capability of the constructed QSPR model, including the coefficient of determination (R 2 ), adjustable coefficient of determination (R 2 adj ), the standard error (S), the Fisher criterion (F), the Root Mean Square Error (RMSE), Leave One Out Cross-Validated coefficient of determination (Q 2 LOO-CV ) and the average absolute relative deviation (AARD%). More details on the statistical parameters are provided in the supporting information file (i.e. Table S3 in the supplementary file). In the present study, both internal and external validation methods have been applied. The outcome of such analysis is presented in the following section.
It was surprising that the same descriptor (i.e. "EEig02d") appears in all developed models at the fixed and unfixed temperatures. The descriptor "EEig02d" is a molecular descriptor derived from the edge adjacency matrix of a molecule that is weighted by dipole moments. The "EEig02d" descriptor is related to the molar volume of the molecule 58 .
As can be seen in Table 2 and for datasets no. (3)- (10), the best combinations of the ln(P) variable and selected descriptor have been obtained for each fixed temperature (i.e. 293, 303, 313, or 323 K) with their corresponding molar ratio (i.e., 1:3 and 1:4). Besides, the models containing three variables (i.e. ln(P), T, and selected descriptor) have been developed for the unfixed temperature datasets.
Validation of the models and statistical evaluation. According to Sarmad et al. 20 , the correlation between ln(x) and ln(P) has been tested for each involved system in any datasets (Please see Table S4 and Fig. S2 in the supplementary file).
In order to evaluate the performance of the developed QSPR models, external validation should be performed. First, data splitting into training and test sets have been created by the Principal component analysis (PCA) method 59 . According to the PCA analysis, for all datasets, the test sets should be chosen in such a way to contain some new structures compared to the train set.
The experimental versus the predicted values of CO 2 solubility are shown in Figs. 1 and 2 for dataset no. (1) with variable temperature and dataset no. (5) with fixed temperature, respectively. These figures for other datasets can be found in the supporting information file (Figs. S4a-S13a).
As can be seen in Figs. 1a and 2a, the prediction capability of models using Eqs. (5) and (15) is not acceptable because these models only consider the effect of pressure on the CO 2 solubility. However, according to Figs. 1b and 2b, taking into account the HBD structural effect in Eqs. (7) and (16)  According to the developed models, the "EEig02d" descriptor is the appropriate structural variable for the prediction of solubility of CO 2 . It is clear that the "EEig02d" descriptor appeared in all models, so it can be concluded that it was not selected randomly. The values of the predicted CO 2 solubility by the QSPR models mentioned in Table 2 for each data point of all datasets are available in the supporting Excel file. Table 3 presents the outcome of the statistical examination of the constructed models. As can be observed in Table 3, the models including the EEig02d descriptor, showed the best statistical parameters in both logarithmic and non-logarithmic scales considering both internal and external validations.
In order to investigate the applicability of the unfixed temperature models in new temperatures and pressures, datasets no. (11) and (12) were used. In other words, these datasets contain some new HBDs (i.e. Glycerol in dataset no (11) and Urea and Ethylene glycol in dataset no. (12)). Moreover, both datasets have some new temperatures (i.e., 298 and 333 K) and pressures (i.e. 10 bar) which were different comparing the datasets no. (1) and (2) applied for the model development. According to Fig. S14 in the supplementary word file, all datapoints in these two new datasets were in the domine of applicability, Therefore, Eq. (7) and (10) for dataset no. (11) and (12) can be applied, respectively. Figure 5 shows the experimental versus the predicted values of CO 2 solubility for dataset no. (11) using Eq. (7) and dataset no. (12) using Eq. (10), respectively. Surprisingly, the proposed models showed very good capability for the prediction of solubility at low pressure (i.e. low solubility). At high pressure (i.e., high solubility) the prediction of solubility shows an acceptable deviation, which confirms the robustness and applicability of the models at different temperatures and pressures even for new structures.

Discussion
It should be proven that the selected descriptor has the best performance for the prediction of the CO 2 solubility. In this regard, some sub-datasets have been selected randomly from the datasets no. (1) and (2) in such a way that in each sub-dataset temperature, pressure and molar ratio was almost constant and only the structure of HBDs was variable. Then, some models with only one variable (i.e., structural descriptor) have been developed and compared statistically. For instance, Fig. 6 shows the values of R 2 and Q 2 for one of these sub-datasets consisting data with pressure approximately 5 bar, temperature of 313 K and HBA to HBD molar ratio of 1:4. The figures corresponding to other sub-datasets are shown in the supplementary word file.
As it is clear from Fig. 6 and Fig. S15, there are several models such that their statistical parameters satisfy the Golbraikh criterion (R 2 > 0.6 and Q 2 > 0.5) 60 . The values of descriptors with acceptable statistical parameters are indicated in Table 4. The values of some descriptors (i.e. H6m and RDF065u) are zero for several HBDs. It means that these descriptors are not appropriate for the model development because these descriptors cannot distinguish between some structures. Apart from this point, it is obvious that it is better to choose a descriptor that is not only repeated in all of the sub-datasets, but have acceptable statistical parameters. Therefore, it is confirmed that the selected descriptor (i.e., EEig02d) is an appropriate molecular descriptor in the developed models. www.nature.com/scientificreports/ After model development, the molecular descriptor that appeared in the QSPR models (i.e. "EEig02d") should be interpreted to explain why it is related to the CO 2 solubility in DESs. The "EEig02d" descriptor, developed by Estrada et al. 58,61 , corresponds to the second eigenvalue of the edge adjacency matrix of the molecule, which is weighted by dipole moments of atoms. The edge adjacency matrix is obtained through a hydrogen-depleted molecular graph, a graph whose nodes are related to the atoms of the molecule and edges are related to the chemical bonds. The molecular graphs are converted into mathematical expression like matrices to correlate the structure and properties quantitatively. The edge-adjacency matrix (EA(G)) of a graph G is defined as follows 62 : For the adjacency matrix of a weighted graph, Eq. (27) should be modified as Ref. 62 : where e i and e j are the chemical bonds, and K is the weights of edges. Table 5 shows the values of EEig02d along with the molar volume and the molecular structure of all HBDs involved in the datasets. It should be mentioned that the EEig02d descriptor can be related to the molar volume of the molecule 58 .
It is plausible that the values of the EEig02d increase by increasing the length of the alkyl chain of HBD. For example, the value of EEig02d for 1,2-Propanediol with three carbons in its alkyl chain and 1,4-Butanediol and 2,3-Butanediol with four carbons in their alkyl chains are 1.054 and 1.519, respectively. It is also observed that the presence of the ether group also increases the value of the EEig02d descriptor. In this regard, the values of the EEig02d for guaiacol are higher compared to phenol (1.983 versus 1.521), due to the presence of the ether group in guaiacol structure. It should be noted that increasing the length of the alkyl chain results in an increment in the molecular free volume in the DESs. Also, the presence of ether groups increases the flexibility of the alkyl chain and thus leads to an increase in the free volume, and consequently enhances the solubility of CO 2 in DES because of the physical nature of absorption (i.e. free volume mechanism) 16,20 .
Moreover, according to Li et al. 24 , the increment of pressure and temperature have a positive and a negative effect on the CO 2 solubility, respectively. These findings are consistent with the developed models indicated in Table 2 since EEig02d and the pressure have appeared with a positive sign, and the temperature has appeared with a negative sign. The enhancement in CO 2 solubility by increasing the length of the alkyl chain group was also demonstrated by experimental works.

Conclusion
In the current study, QSPR approach was employed to develop linear models for predicting the CO 2 solubility in the DESs. The main aim was to investigate the effect of the structure of HBDs on the solubility of CO 2 in the ChCl-based DESs. The main findings are as follows: www.nature.com/scientificreports/ • It was noteworthy that the same descriptor (i.e. EEig02d) along with ln(P) appeared in all developed models, independent of the effect of temperature. It was found that the EEig02d descriptor is related to the molar volume and dipole moment of a molecule. Examination of the models indicated that the solubility increases with increasing the values of the EEig02d descriptor because there is a direct relationship between physical absorption and the free volume of the molecule.  www.nature.com/scientificreports/ • Two general models in HBA to HBD molar ratios equal to 1:3 and 1:4 were constructed by the combination of ln(P), T, and EEig02d as the structural descriptor variable to predict the CO 2 solubility in ChCl-based DESs at any desired temperature. These models were examined by further external validation using two additional datasets containing new HBD structures. • This study provided reliable and simple QSPR models for predicting the CO 2 solubility in ChCl-based DESs, which can be applied in the preliminary screening of the DESs in the PCC processes.